Information filtering in high velocity text streams using limited memory: an event-driven approach to text stream analysis

نویسنده

  • Andreas Bauer
چکیده

This dissertation is concerned with the processing of high velocity text using event processing means. It comprises a scientific approach for combining the area of information filtering and event processing, in order to analyse fast and voluminous streams of text. In order to be able to process text streams within event driven means, an event reference model was developed that allows for the conversion of unstructured or semi-structured text streams into discrete event types on which event processing engines can operate. Additionally, a set of essential reference processes in the domain of information filtering and text stream analysis were described using eventdriven concepts. In a second step, a reference architecture was designed that described essential architectural components required for the design of information filtering and text stream analysis systems in an event-driven manner. Further to this, a set of architectural patterns for building event driven text analysis systems was derived that support the design and implementation of such systems. Subsequently, a prototype was built using the theoretic foundations. This system was initially used to study the effect of sliding window sizes on the properties of dynamic sub-corpora. It could be shown that small sliding window based corpora are similar to larger sliding windows and thus can be used as a resource-saving alternative. Next, a study of several linguistic aspects of text streams was undertaken that showed that event stream summary statistics can provide interesting insights into the characteristics of high velocity text streams. Finally, four essential information filtering and text stream analysis components were studied, viz. filter policies, term weighting, thresholds and query expansion. These were studied using three temporal search profile types and were evaluated using standard performance measures. The goal was to study the efficiency of traditional as well as new algorithms within the given context of high velocity text stream data, in order to provide advise which methods work best. The results of this dissertation are intended to provide software architects and developers with valuable information for the design and implementation of event-driven text stream analysis systems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Event based classification of Web 2.0 text streams

Web 2.0 applications like Twitter or Facebook create a continuous stream of information. This demands new ways of analysis in order to offer insight into this stream right at the moment of the creation of the information, because lots of this data is only relevant within a short period of time. To address this problem real time search engines have recently received increased attention. They tak...

متن کامل

Using Stream Features for Instant Document Filtering

In this paper, we discuss how event processing technologies can be employed for real-time text stream processing and information filtering in the context of the TREC 2012 microblog task. After introducing basic characteristics of stream and event processing, the technical architecture of our text stream analysis engine is presented. Employing wellknown term weighting schemes from document-centr...

متن کامل

High capacity steganography tool for Arabic text using 'Kashida'

Steganography is the ability to hide secret information in a cover-media such as sound, pictures and text. A new approach is proposed to hide a secret into Arabic text cover media using "Kashida", an Arabic extension character. The proposed approach is an attempt to maximize the use of "Kashida" to hide more information in Arabic text cover-media. To approach this, some algorithms have been des...

متن کامل

A survey on Automatic Text Summarization

Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...

متن کامل

Temporal and Information Flow Based Event Detection from Social Text Streams

Recently, social text streams (e.g., blogs, web forums, and emails) have become ubiquitous with the evolution of the web. In some sense, social text streams are sensors of the real world. Often, it is desirable to extract real world events from the social text streams. However, existing event detection research mainly focused only on the stream properties of social text streams but ignored the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016